Chapter 20 

Random Variables 



So far we focused on probabilities of events — that you win the Monty Hall game; 
that you have a rare medical condition, given that you tested positive; — Now we 
focus on quantitative questions: How many contestants must play the Monty Hall 
game until one of them finally wins? . . . How long will this condition last? How 
much will I lose playing 6.042 games all day? Random variables are the mathemat- 
ical tool for addressing such questions. 



20.1 Random Variable Examples 

Definition 20.1.1. A random variable, R, on a probability space is a total function 
whose domain is the sample space. 

The codomain of R can be anything, but will usually be a subset of the real 
numbers. Notice that the name "random variable" is a misnomer; random vari- 
ables are actually functions! 

For example, suppose we toss three independent, unbiased coins. Let C be the 
number of heads that appear. Let M = 1 if the three coins come up all heads or all 
tails, and let M = otherwise. Now every outcome of the three coin flips uniquely 
determines the values of C and M. For example, if we flip heads, tails, heads, then 
C = 2 and M = 0. If we flip tails, tails, tails, then C = and M = 1. In effect, C 
counts the number of heads, and M indicates whether all the coins match. 

Since each outcome uniquely determines C and M, we can regard them as 
functions mapping outcomes to numbers. For this experiment, the sample space 
is: 

S = { HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} . 

Now C is a function that maps each outcome in the sample space to a number as 
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follows: 



C(HHH) = 3 C(THH) = 2 

C(HHT) = 2 C(THT) = 1 

C(HTH) = 2 C(TTH) = 1 

C(HTT) = 1 C(TTT) = 0. 



Similarly, M is a function mapping each outcome another way: 



M(HHH) 
M(HHT) 
M(HTH) 
M(HTT) 



1 







M{THH) 
M(THT) 
M(TTH) 
M(TTT) 







1. 



So C and M are random variables. 

20.1.1 Indicator Random Variables 

An indicator random variable is a random variable that maps every outcome to ei- 
ther or 1. These are also called Bernoulli variables. The random variable M is an 
example. If all three coins match, then M = 1; otherwise, M = 0. 

Indicator random variables are closely related to events. In particular, an in- 
dicator partitions the sample space into those outcomes mapped to 1 and those 
outcomes mapped to 0. For example, the indicator M partitions the sample space 
into two blocks as follows: 



In the same way, an event, E, partitions the sample space into those outcomes 
in E and those not in E. So E is naturally associated with an indicator random 
variable, I E , where Ie(p) = 1 for outcomes p e E and Ie(p) = for outcomes 
p £ E. Thus, M = I F where F is the event that all three coins match. 

20.1.2 Random Variables and Events 

There is a strong relationship between events and more general random variables 
as well. A random variable that takes on several values partitions the sample space 
into several blocks. For example, C partitions the sample space as follows: 



Each block is a subset of the sample space and is therefore an event. Thus, we 
can regard an equation or inequality involving a random variable as an event. For 
example, the event that C = 2 consists of the outcomes THH, HTH, and HHT. 
The event C < 1 consists of the outcomes TTT, TTH, THT, and HTT. 



HHH TTT HHT HTH HTT THH THT TTH. 
' y ' 



M = 1 M = 



TTT TTH THT HTT THH HTH HHT HHH. 
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Naturally enough, we can talk about the probability of events defined by prop- 
erties of random variables. For example, 

Pr {C = 2} = Pr {THH} + Pr {HTH} + Pr {HHT} 
1 1 1 _ 3 
8 + 8 + 8 ~ 8' 

20.1.3 Independence 

The notion of independence carries over from events to random variables as well. 
Random variables R\ and R 2 are independent iff for all x\ in the codomain of R±, 
and x 2 in the codomain of R 2 , we have: 

Pr {R 1 = xi AND R 2 = x 2 } = Pr {Ri = xi} ■ Pr {R 2 = x 2 } . 

As with events, we can formulate independence for random variables in an equiv- 
alent and perhaps more intuitive way: random variables Ri and R 2 are indepen- 
dent if for all x\ and x 2 

Pr {R l = Xl | R 2 = x 2 } = Pr {Ri = an} . 

whenever the lefthand conditional probability is defined, that is, whenever Pr {R 2 = x 2 } > 
0. 

As an example, are C and M independent? Intuitively, the answer should be 
"no". The number of heads, C, completely determines whether all three coins 
match; that is, whether M = 1. But, to verify this intuition, we must find some 
xi,x 2 eR such that: 

Pr {C = xi AND M = x 2 } ^ Pr {C = xi} ■ Pr {M = x 2 } . 

One appropriate choice of values is x\ — 2 and x 2 = 1. In this case, we have: 

Pr {C = 2 AND M = 1} = ^ \ ■ I = Pr {M = 1} • Pr {C = 2} . 

4 8 

The first probability is zero because we never have exactly two heads (C = 2) when 
all three coins match (M = 1). The other two probabilities were computed earlier. 

On the other hand, let Hi be the indicator variable for event that the first flip is 
a Head, so 

[Hi = 1] = {HHH, HTH, HHT, HTT} . 
Then Hi is independent of M, since 

Pr {M = 1} = 1/4 = Pr {M = 1 | H x = 1} = Pr {M = 1 | Hi = 0} 
Pr{M = 0} = 3/4 = Pr{M = | Hi = 1} = Pr {M = | Hi = 0} 

This example is an instance of a simple lemma: 

Lemma 20.1.2. Two events are independent iff their indicator variables are independent. 
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As with events, the notion of independence generalizes to more than two ran- 
dom variables. 

Definition 20.1.3. Random variables Ri, R 2 , . . . , R n are mutually independent iff 

Pr {Ri = Xi AND R 2 = x 2 AND • • • AND R n = x n } 
= Yx{R 1 =x 1 }-Yx{R 2 = x 2 }---Yx{R n = x n }. 

for all xi,x 2 , . . . , x n - 

It is a simple exercise to show that the probability that any subset of the variables 
takes a particular set of values is equal to the product of the probabilities that the 
individual variables take their values. Thus, for example, if Ri, R 2 , . . . , -R100 are 
mutually independent random variables, then it follows that: 

Pr {i?i = 7 AND R 7 = 9.1 AND i? 23 = tt} = Pr {#1 = 7}-Pr {R 7 = 9.1}-Pr {R 23 = n} . 



20.2 Probability Distributions 

A random variable maps outcomes to values, but random variables that show up 
for different spaces of outcomes wind up behaving in much the same way because 
they have the same probability of taking any given value. Namely, random vari- 
ables on different probability spaces may wind up having the same probability 
density function. 

Definition 20.2.1. Let R be a random variable with codomain V. The probability 
density function (pdf) of R is a function PDF fi : V — ► [0, 1] defined by: 

PDF ( X )-={ YX{R = X] ifa; e ranged), 
RK \0 if x $ range (R) . 



A consequence of this definition is that 



J2 PDF fl (a;) = 1. 

rrerange(-R) 



This follows because R has a value for each outcome, so summing the probabilities 
over all outcomes is the same as summing over the probabilities of each value in 
the range of R. 

As an example, let's return to the experiment of rolling two fair, independent 
dice. As before, let T be the total of the two rolls. This random variable takes on 
values in the set V = {2, 3, . . . , 12}. A plot of the probability density function is 
shown below: 



20.2. PROBABILITY DISTRIBUTIONS 



463 



6/36 

PDF^x) 

3/36 



2 3 4 5 6 7 8 9 10 11 12 

x e V 

The lump in the middle indicates that sums close to 7 are the most likely. The total 
area of all the rectangles is 1 since the dice must take on exactly one of the sums in 

V= {2,3,..., 12}. 

A closely-related idea is the cumulative distribution function (cdf) for a random 
variable R whose codomain is real numbers. This is a function CDF R : R — > [0, 1] 
defined by: 

CDF R (x) = Pr {R < x} 

As an example, the cumulative distribution function for the random variable T is 
shown below: 

1 t i 1 



CDF fl (x) 



1/2 



2 3 4 5 6 7 8 9 10 11 12 

x e v 

The height of the i-th bar in the cumulative distribution function is equal to the 
sum of the heights of the leftmost i bars in the probability density function. This 
follows from the definitions of pdf and cdf: 

CDF R {x) =Pr{R< x} 

= ^Pr{i? = y} 

y<x 

= ]T PDF R {y) 

y<x 

In summary PDF^a;) measures the probability that R = x and CDF^x) mea- 
sures the probability that R < x. Both the PDF fl and CDF^ capture the same 
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information about the random variable R — you can derive one from the other 
— but sometimes one is more convenient. The key point here is that neither the 
probability density function nor the cumulative distribution function involves the 
sample space of an experiment. 

We'll now look at three important distributions and some applications. 

20.2.1 Bernoulli Distribution 

Indicator random variables are perhaps the most common type because of their 
close association with events. The probability density function of an indicator ran- 
dom variable, B, is always 

PDF B (0)=p 
PDFu(l) = 1-p 

where < p < 1. The corresponding cumulative distribution function is: 

CDF B (0) =p 
CDF S (1) = 1 

20.2.2 Uniform Distribution 

A random variable that takes on each possible value with the same probability is 
called uniform. For example, the probability density function of a random variable 
U that is uniform on the set {1,2,..., N} is: 

PDF^fc) = 1 
And the cumulative distribution function is: 

CDFu(k) = ^ 

Uniform distributions come up all the time. For example, the number rolled on a 
fair die is uniform on the set {1, 2, ... , 6}. 

20.2.3 The Numbers Game 

Let's play a game! I have two envelopes. Each contains an integer in the range 
0, 1, . . . , 100, and the numbers are distinct. To win the game, you must determine 
which envelope contains the larger number. To give you a fighting chance, I'll let 
you peek at the number in one envelope selected at random. Can you devise a 
strategy that gives you a better than 50% chance of winning? 

For example, you could just pick an envelope at random and guess that it con- 
tains the larger number. But this strategy wins only 50% of the time. Your challenge 
is to do better. 
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So you might try to be more clever. Suppose you peek in the left envelope and 
see the number 12. Since 12 is a small number, you might guess that that other 
number is larger. But perhaps I'm sort of tricky and put small numbers in both 
envelopes. Then your guess might not be so good! 

An important point here is that the numbers in the envelopes may not be ran- 
dom. I'm picking the numbers and I'm choosing them in a way that I think will 
defeat your guessing strategy. I'll only use randomization to choose the numbers 
if that serves my end: making you lose! 

Intuition Behind the Winning Strategy 

Amazingly there is a strategy that wins more than 50% of the time, regardless of 
what numbers I put in the envelopes! 

Suppose that you somehow knew a number x between my lower number and 
higher numbers. Now you peek in an envelope and see one or the other. If it is 
bigger than x, then you know you're peeking at the higher number. If it is smaller 
than x, then you're peeking at the lower number. In other words, if you know a 
number x between my lower and higher numbers, then you are certain to win the 
game. 

The only flaw with this brilliant strategy is that you do not know x. Oh well. 

But what if you try to guess x? There is some probability that you guess cor- 
rectly. In this case, you win 100% of the time. On the other hand, if you guess 
incorrectly, then you're no worse off than before; your chance of winning is still 
50%. Combining these two cases, your overall chance of winning is better than 
50%! 

Informal arguments about probability, like this one, often sound plausible, but 
do not hold up under close scrutiny. In contrast, this argument sounds completely 
implausible — but is actually correct! 

Analysis of the Winning Strategy 

For generality, suppose that I can choose numbers from the set {0, 1, ... , n}. Call 
the lower number L and the higher number H. 

Your goal is to guess a number x between L and H . To avoid confusing equality 
cases, you select x at random from among the half -integers: 

But what probability distribution should you use? 

The uniform distribution turns out to be your best bet. An informal justification 
is that if I figured out that you were unlikely to pick some number — say 50 \ — 
then I'd always put 50 and 51 in the evelopes. Then you'd be unlikely to pick an x 
between L and H and would have less chance of winning. 

After you've selected the number x, you peek into an envelope and see some 
number p. If p > x, then you guess that you're looking at the larger number. If 
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p < x, then you guess that the other number is larger. 

All that remains is to determine the probability that this strategy succeeds. We 
can do this with the usual four step method and a tree diagram. 
Step 1: Find the sample space. You either choose x too low (< L), too high (> H), 
or just right (L < x < H). Then you either peek at the lower number (p = L) or the 
higher number (p = H). This gives a total of six possible outcomes. 



# peeked at 



result probability 



choice of x 

x too low 

x just right 




lose 



win 



L/2n 
L/2n 



x too high 



(n-H)/n 



win (H-L)/2n 

win (H-L)/2n 

win (n-H)/2n 

lose (n-H)/2n 



Step 2: Define events of interest. The four outcomes in the event that you win 
are marked in the tree diagram. 

Step 3: Assign outcome probabilities. First, we assign edge probabilities. Your 
guess x is too low with probability L/n, too high with probability (n — H)/n, and 
just right with probability (H — L) /n. Next, you peek at either the lower or higher 
number with equal probability. Multiplying along root-to-leaf paths gives the out- 
come probabilities. 

Step 4: Compute event probabilities. The probability of the event that you win 
is the sum of the probabilities of the four outcomes in that event: 



Pr {win} = 



L 

2n 
1 

2 4 
1 



H -L 

f 2n 
H -L 

2n 



H -L 
2n 



n-H 
2n 



1 

> - H 

~ 2 2n 

The final inequality relies on the fact that the higher number H is at least 1 greater 
than the lower number L since they are required to be distinct. 

Sure enough, you win with this strategy more than half the time, regardless 
of the numbers in the envelopes! For example, if I choose numbers in the range 
0, 1, ... , 100, then you win with probability at least \ + ^ = 50.5%. Even better, if 
I'm allowed only numbers in the range 0, . . . , 10, then your probability of winning 
rises to 55%! By Las Vegas standards, those are great odds! 
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20.2.4 Binomial Distribution 

The binomial distribution plays an important role in Computer Science as it does in 
most other sciences. The standard example of a random variable with a binomial 
distribution is the number of heads that come up in n independent flips of a coin; 
call this random variable H n . If the coin is fair, then H n has an unbiased binomial 
density function: 



PDFff„ (fc) 



This follows because there are (™) sequences of n coin tosses with exactly k heads, 
and each such sequence has probability 2~™. 

Here is a plot of the unbiased probability density function PDF# n (k) corre- 
sponding to n = 20 coins flips. The most likely outcome is k = 10 heads, and the 
probability falls off rapidly for larger and smaller values of k. These falloff regions 
to the left and right of the main hump are usually called the tails of the distribution. 




In many fields, including Computer Science, probability analyses come down to 
getting small bounds on the tails of the binomial distribution. In the context of a 
problem, this typically means that there is very small probability that something 
bad happens, which could be a server or communication link overloading or a 
randomized algorithm running for an exceptionally long time or producing the 
wrong result. 

As an example, we can calculate the probability of flipping at most 25 heads in 
100 tosses of a fair coin and see that it is very small, namely, less than 1 in 3,000,000. 

In fact, the tail of the distribution falls off so rapidly that the probability of 
flipping exactly 25 heads is nearly twice the probability of flipping fewer than 25 
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heads! That is, the probability of flipping exactly 25 heads — small as it is — is 
still nearly twice as large as the probability of flipping exactly 24 heads plus the 
probability of flipping exactly 23 heads plus . . . the probability of flipping no heads. 



The General Binomial Distribution 

Now let J be the number of heads that come up on n independent coins, each of 
which is heads with probability p. Then J has a general binomial density function: 



PDF. 7 (fc) = ( ' k ) P k (i-py i ~> 



As before, there are (£) sequences with k heads and n — k tails, but now the prob- 
ability of each such sequence is p k (l — p) n ~ k . 

As an example, the plot below shows the probability density function PDF j{k) 
corresponding to flipping n = 20 independent coins that are heads with probabilty 
p = 0.75. The graph shows that we are most likely to get around k = 15 heads, 
as you might expect. Once again, the probability falls off quickly for larger and 
smaller values of k. 



0.25 
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20.2.5 Problems 
Class Problems 



Guess the Bigger Number Game 

Team 1: 

• Write different integers between and 7 on two pieces of paper. 

• Put the papers face down on a table. 

Team 2: 

• Turn over one paper and look at the number on it. 

• Either stick with this number or switch to the unseen other number. 

Team 2 wins if it chooses the larger number. 



Problem 20.1. 

In section 20.2.3, Team 2 was shown to have a strategy that wins 4/7 of the time 
no matter how Team 1 plays. Can Team 2 do better? The answer is "no," because 
Team 1 has a strategy that guarantees that it wins at least 3/7 of the time, no matter 
how Team 2 plays. Describe such a strategy for Team 1 and explain why it works. 



Problem 20.2. 

Suppose X\, X2, and X 3 are three mutually independent random variables, each 
having the uniform distribution 

Pr {Xi = k} equal to 1 /3 for each of A; = 1,2,3. 

Let M be another random variable giving the maximum of these three random 
variables. What is the density function of Ml 

Homework Problems 
Problem 20.3. 

A drunken sailor wanders along main street, which conveniently consists of the 
points along the x axis with integral coordinates. In each step, the sailor moves 
one unit left or right along the x axis. A particular path taken by the sailor can be 
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described by a sequence of "left" and "right" steps. For example, (left,left,right) 
describes the walk that goes left twice then goes right. 

We model this scenario with a random walk graph whose vertices are the in- 
tegers and with edges going in each direction between consecutive integers. All 
edges are labelled 1/2. 

The sailor begins his random walk at the origin. This is described by an initial 
distribution which labels the origin with probability 1 and all other vertices with 
probability 0. After one step, the sailor is equally likely to be at location 1 or — 1, 
so the distribution after one step gives label 1 /2 to the vertices 1 and — 1 and labels 
all other vertices with probability 0. 

(a) Give the distributions after the 2nd, 3rd, and 4th step by filling in the table 
of probabilities below, where omitted entries are 0. For each row, write all the 
nonzero entries so they have the same denominator. 

location 

-4-3-2-10 1 2 3 4 



initially 
after 1 step 
after 2 steps 
after 3 steps 
after 4 steps 



1/2 
? 
? 
? 



1/2 

7 
1 
? 



(b) 



1. What is the final location of a t-step path that moves right exactly i times? 

2. How many different paths are there that end at that location? 

3. What is the probability that the sailor ends at this location? 

(c) Let L be the random variable giving the sailor's location after t steps, and let 
B ::= (L + t)/2. Use the answer to part (b) to show that B has an unbiased binomial 
density function. 

(d) Again let L be the random variable giving the sailor's location after t steps, 
where t is even. Show that 



< 



1 



So there is a better than even chance that the sailor ends up at least vt/2 steps from 
where he started. 

Hint: Work in terms of B. Then you can use an estimate that bounds the binomial 
distribution. Alternatively, observe that the origin is the most likely final location 
and then use the asymptotic estimate 



Pr{L = 0}=Pr{5 = t/2} 
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20.3 Average & Expected Value 

The expectation of a random variable is its average value, where each value is 
weighted according to the probability that it comes up. The expectation is also 
called the expected value or the mean of the random variable. 

For example, suppose we select a student uniformly at random from the class, 
and let R be the student's quiz score. Then E [R] is just the class average — the first 
thing everyone wants to know after getting their test back! For similar reasons, 
the first thing you usually want to know about a random variable is its expected 
value. 

Definition 20.3.1. 

E [#]::= x-Pr{R = x} (20.1) 

a;Grange(-R) 
zerange(-R) 

Let's work through an example. Let R be the number that comes up on a fair, 
six-sided die. Then by (20.1), the expected value of R is: 

k=l 

111111 

- 1 -6 +2 -6 +3 '6 +4 -6 +5 -6 +6 -6 
_ 7 

~ 2 

This calculation shows that the name "expected value" is a little misleading; the 
random variable might never actually take on that value. You don't ever expect to 
roll a 3 \ on an ordinary die! 

There is an even simpler formula for expectation: 

Theorem 20.3.2. If R is a random variable defined on a sample space, S, then 

E[R} = J2 R (")?*M (20.2) 

LO£S 

The proof of Theorem 20.3.2, like many of the elementary proofs about expec- 
tation in this chapter, follows by judicious regrouping of terms in the defining 
sum (20.1): 
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Proof. 



E[R}::= x-Px{R=x} 

x (= range (R) 

= E *( E Pr M 
= E E ^ Pr M 

a?{Erange(-f£) ojG[.R— sp] 

= E E %) ft M 

KGrange(-R) u£[fl=i] 



(Def 20.3.1 of expectation) 

(def of Pr {R = x}) 
(distributing x over the inner sum) 
(def of the event [R = x]) 



The last equality follows because the events [R = x] for x e range (R) partition the 
sample space, S, so summing over the outcomes in [R = x] for x 6 range (R) is the 
same as summing over S. ■ 

In general, the defining sum (20.1) is better for calculating expected values and 
has the advantage that it does not depend on the sample space, but only on the 
density function of the random variable. On the other hand, the simpler sum over 
all outcomes (20.2)is sometimes easier to use in proofs about expectation. 



20.3.1 Expected Value of an Indicator Variable 

The expected value of an indicator random variable for an event is just the proba- 
bility of that event. 

Lemma 20.3.3. If I a is the indicator random variable for event A, then 

E[I A ]=Pr{A}. 

Proof. 



E[I A ] = 1 ■ Pr{/4 = 1} + 0-Pr {I A = 0} 
= Pr{/ A = l} 
= Pr {A} . 



(def of I A ) 



For example, if A is the event that a coin with bias p comes up heads, E [I a] = 
Pv{I A = l}=p. 
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20.3.2 Conditional Expectation 

Just like event probabilities, expectations can be conditioned on some event. 

Definition 20.3.4. The conditional expectation, E [R \ A], of a random variable, R, 
given event, A, is: 

E [R | A] ::= Yj r-Fr{R = r\ A}. (20.3) 

rSrange(-R) 

In other words, it is the average value of the variable R when values are weighted 
by their conditional probabilities given A. 

For example, we can compute the expected value of a roll of a fair die, given, 
for example, that the number rolled is at least 4. We do this by letting R be the 
outcome of a roll of the die. Then by equation (20.3), 

6 

E[R\R>4) = J2i-Pr{R = i | i? > 4} = 1-0+2-0 + 3-0 + 4- § + 5- § + 6- § = 5. 

i=l 

The power of conditional expectation is that it lets us divide complicated ex- 
pectation calculations into simpler cases. We can find the desired expectation by 
calculating the conditional expectation in each simple case and averaging them, 
weighing each case by its probability. 

For example, suppose that 49.8% of the people in the world are male and the 
rest female — which is more or less true. Also suppose the expected height of a 
randomly chosen male is 5' 11", while the expected height of a randomly chosen 
female is 5' 5". What is the expected height of a randomly chosen individual? We 
can calculate this by averaging the heights of men and women. Namely, let H be 
the height (in feet) of a randomly chosen person, and let M be the event that the 
person is male and F the event that the person is female. We have 

E [H] = E [H | M] Pr {M} + E [H \ F] Pr {F} 
= (5 + 11/12) • 0.498 + (5 + 5/12) • 0.502 
= 5.665 

which is a little less that 5' 8". 

The Law of Total Expectation justifies this method. 

Theorem 20.3.5. Let A%, A%, . . ■ be a partition of the sample space. Then 

Rule (Law of Total Expectation). 



E[R] = Y,K[R\A i ]Yx{A i } 

i 
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Proof. 



E [R] ::= ^ r ■ Pr {R = r} (Def 20.3.1 of expectation) 

rSrange(-R) 

= r ■ Pr { R = r I A i]^{ A i\ (Law of Total Probability) 

r i 

= S*\ Y r ' Pr { R = r I Pr {Aj} (distribute constant r) 

r i 

= 2_. 5^ r • Pr {i? = r | Ai} Pr {A^} (exchange order of summation) 
= Pr {^i} $Z r ' Pr { R = r I A ^ (factor constant Pr {Aj}) 

i r 

= ^ Pr {Aj}E [i? | A] . (Def 20.3.4 of cond. expectation) 



20.3.3 Mean Time to Failure 

A computer program crashes at the end of each hour of use with probability p, if it 
has not crashed already. What is the expected time until the program crashes? 

If we let C be the number of hours until the crash, then the answer to our 
problem is E [C] . Now the probability that, for i > 0, the first crash occurs in the 
ith hour is the probability that it does not crash in each of the first i — 1 hours 
and it does crash in the ith hour, which is (1 — p)' l ~ 1 p. So from formula (20.1) for 
expectation, we have 

E[C] = ^i-Pr{i?=i} 

= P ^z(i- P r 1 

1 

= p-, 



(l-(l-p)) 2 



(by (17.1)) 



1 

V 



A simple alternative derivation that does not depend on the formula (17.1) 
(which you remembered, right?) is based on conditional expectation. Given that 
the computer crashes in the first hour, the expected number of hours to the first 
crash is obviously 1! On the other hand, given that the computer does not crash 
in the first hour, then the expected total number of hours till the first crash is the 
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expectation of one plus the number of additional hours to the first crash. So, 



from which we immediately calculate that E [C] = I /p. 

So, for example, if there is a 1% chance that the program crashes at the end of 
each hour, then the expected time until the program crashes is 1/0.01 = 100 hours. 

As a further example, suppose a couple really wants to have a baby girl. For 
simplicity assume there is a 50% chance that each child they have is a girl, and the 
genders of their children are mutually independent. If the couple insists on having 
children until they get a girl, then how many baby boys should they expect first? 

This is really a variant of the previous problem. The question, "How many 
hours until the program crashes?" is mathematically the same as the question, 
"How many children must the couple have until they get a girl?" In this case, a 
crash corresponds to having a girl, so we should set p = 1/2. By the preceding 
analysis, the couple should expect a baby girl after having l/p—2 children. Since 
the last of these will be the girl, they should expect just one boy. 

Something to think about: If every couple follows the strategy of having chil- 
dren until they get a girl, what will eventually happen to the fraction of girls born 
in this world? 

20.3.4 Linearity of Expectation 

Expected values obey a simple, very helpful rule called Linearity of Expectation. Its 
simplest form says that the expected value of a sum of random variables is the sum 
of the expected values of the variables. 



Theorem 20.3.6. For any random variables Ri and R 2 , 

E [R x + R 2 ] = E [Rj] + E [R 2 ] . 

Proof. Let T ::= R± + R2. The proof follows straightforwardly by rearranging terms 
in the sum (20.2) 



E[C] = p-l + (l-p)E[C+l] =p + E[C]-pE[C] + l-p, 




(Theorem 20.3.2) 




(def of T) 




(rearranging terms) 



E[i?!]+E[i? 2 ]. 



(Theorem 20.3.2) 



A small extension of this proof, which we leave to the reader, implies 
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Theorem 20.3.7 (Linearity of Expectation). For random variables R\, R 2 and constants 
cii , a 2 € R, 

E [oifli + a 2 R 2 ] = ai E + a 2 E [i? 2 ] • 

In other words, expectation is a linear function. A routine induction extends 
the result to more than two variables: 

Corollary 20.3.8. For any random variables Ri , . . . , R k and constants a\, . . . , a k e R, 



^ a, E . 



i=i 



The great thing about linearity of expectation is that no independence is required. 
This is really useful, because dealing with independence is a pain, and we often 
need to work with random variables that are not independent. 

Expected Value of Two Dice 

What is the expected value of the sum of two fair dice? 

Let the random variable R\ be the number on the first die, and let R 2 be the 
number on the second die. We observed earlier that the expected value of one die 
is 3.5. We can find the expected value of the sum using linearity of expectation: 

E [Ri + R 2 ] = E [R^ + E [R 2 ] = 3.5 + 3.5 = 7. 

Notice that we did not have to assume that the two dice were independent. 
The expected sum of two dice is 7, even if they are glued together (provided each 
individual die remainw fair after the gluing). Proving that this expected sum is 
7 with a tree diagram would be a bother: there are 36 cases. And if we did not 
assume that the dice were independent, the job would be really tough! 



The Hat-Check Problem 

There is a dinner party where n men check their hats. The hats are mixed up during 
dinner, so that afterward each man receives a random hat. In particular, each man 
gets his own hat with probability 1/n. What is the expected number of men who 
get their own hat? 

Letting G be the number of men that get their own hat, we want to find the 
expectation of G. But all we know about G is that the probability that a man gets 
his own hat back is 1 /n. There are many different probability distributions of hat 
permutations with this property, so we don't know enough about the distribution 
of G to calculate its expectation directly. But linearity of expectation makes the 
problem really easy. 

The trick is to express G as a sum of indicator variables. In particular, let Gi be 
an indicator for the event that the zth man gets his own hat. That is, Gi = 1 if he 
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gets his own hat, and d = otherwise. The number of men that get their own hat 
is the sum of these indicators: 



G = d + G 2 + ■ • ■ + G n . (20.4) 

These indicator variables are not mutually independent. For example, if n — 1 men 
all get their own hats, then the last man is certain to receive his own hat. But, since 
we plan to use linearity of expectation, we don't have worry about independence! 

Now since Gi is an indicator, we know 1 jn = Pr{G; = 1} = E [Gi] by Lemma 20.3.3. 
Now we can take the expected value of both sides of equation (20.4) and apply lin- 
earity of expectation: 

E [G] = E [G\ + G 2 + • • • + G n ] 

= E[G 1 ]+E[G 2 ] + --- + E[G„] 

11 1 f l \ , 

= — H 1 1- - = n - = 1. 

n n n \n J 

So even though we don't know much about how hats are scrambled, we've figured 
out that on average, just one man gets his own hat back! 

Expectation of a Binomial Distribution 

Suppose that we independently flip n biased coins, each with probability p of com- 
ing up heads. What is the expected number that come up heads? 

Let J be the number of heads after the flips, so J has the (n,p)-binomial dis- 
tribution. Now let 1^ be the indicator for the fcth coin coming up heads. By 
Lemma 20.3.3, we have 

E [h] = p. 

But 

n 

k=i 

so by linearity 



E [ J] = E 



A' 

Lfc=i 



= ^E[J fe ] = J2p = P u - 

fc=i fc=i 

In short, the expectation of an (n,p)-binomially distributed variable is pn. 



The Coupon Collector Problem 

Every time I purchase a kid's meal at Taco Bell, I am graciously presented with 
a miniature "Racin' Rocket" car together with a launching device which enables 
me to project my new vehicle across any tabletop or smooth floor at high velocity. 
Truly, my delight knows no bounds. 

There are n different types of Racin' Rocket car (blue, green, red, gray, etc.). The 
type of car awarded to me each day by the kind woman at the Taco Bell register 
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appears to be selected uniformly and independently at random. What is the ex- 
pected number of kid's meals that I must purchase in order to acquire at least one 
of each type of Racin' Rocket car? 

The same mathematical question shows up in many guises: for example, what 
is the expected number of people you must poll in order to find at least one person 
with each possible birthday? Here, instead of collecting Racin' Rocket cars, you're 
collecting birthdays. The general question is commonly called the coupon collector 
problem after yet another interpretation. 

A clever application of linearity of expectation leads to a simple solution to the 
coupon collector problem. Suppose there are five different types of Racin' Rocket, 
and I receive this sequence: 

blue green green red blue orange blue orange gray 

Let's partition the sequence into 5 segments: 

blue green green red blue orange blue orange gray 



The rule is that a segment ends whenever I get a new kind of car. For example, the 
middle segment ends when I get a red car for the first time. In this way, we can 
break the problem of collecting every type of car into stages. Then we can analyze 
each stage individually and assemble the results using linearity of expectation. 

Let's return to the general case where I'm collecting n Racin' Rockets. Let X k 
be the length of the fcth segment. The total number of kid's meals I must purchase 
to get all n Racin' Rockets is the sum of the lengths of all these segments: 



Now let's focus our attention on X k , the length of the fcth segment. At the 
beginning of segment k, I have k different types of car, and the segment ends when 
I acquire a new type. When I own k types, each kid's meal contains a type that I 
already have with probability k/n. Therefore, each meal contains a new type of car 
with probability 1 — k/n = (n — k)/n. Thus, the expected number of meals until 
I get a new kind of car is n/(n — k) by the "mean time to failure" formula. So we 
have: 





T = X + Xi + ■ ■ ■ + X„_! 



E [X k ] = 



n 



n — k 



Linearity of expectation, together with this observation, solves the coupon col- 
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lector problem: 

E [T] = E [X Q + Xi + • • • + X n -i] 

= E [X ] + E [X,] + • • • + E [X„_!] 





n 


n 


n 


n 


n 


n 


-0 + 


n — 1 


•• + 3 4 


2 4 








1 


1 


1 




= n 


G+ 


n-l + ' 


•• + 3 + 


2 + 


















i 


1 


1 






»a 


+ 2 H 


-3 + -"- 


T H 

n — 1 


4) 





= nH n ~ nlnn. 



Let's use this general solution to answer some concrete questions. For example, 
the expected number of die rolls required to see every number from 1 to 6 is: 

6H 6 = 14.7... 

And the expected number of people you must poll to find at least one person with 
each possible birthday is: 

365#365 = 2364.6 . . . 



20.3.5 The Expected Value of a Product 

While the expectation of a sum is the sum of the expectations, the same is usually 
not true for products. But it is true in an important special case, namely, when the 
random variables are independent. 

For example, suppose we throw two independent, fair dice and multiply the 
numbers that come up. What is the expected value of this product? 

Let random variables Ri and R 2 be the numbers shown on the two dice. We 
can compute the expected value of the product as follows: 

E [R x ■ R 2 ] = E [R^ ■ E [R 2 ] = 3.5 • 3.5 = 12.25. (20.5) 



Here the first equality holds because the dice are independent. 

At the other extreme, suppose the second die is always the same as the first. 
Now Ri = R 2 , and we can compute the expectation, E [Rf], of the product of the 
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dice explicitly, confirming that it is not equal to the product of the expectations. 

E ■ R 2 ] = E [Rj] 



= Yi 2 -Pr{R 2 1 

i=l 


= * 2 } 

J 




= 


l 2 2 2 3 2 


4 2 5 2 g2 


6 + 6 + 6 


+ 6 + 6 + 6 


= 15 1/6 




^ 12 1/4 




= E [R ± ] ■ E [Jfc] . 





Theorem 20.3.9. For any two independent random variables R\, R 2 , 

E [Si ■ R 2 ] = E [iZj ■ E [R 2 ] . 

Proof. The event • i?2 = f\ can be split up into events of the form [Rx 

r\ and R 2 = r 2 ] where r\ ■ r 2 = r. So 

E [i?! ■ R 2 ] 

::= ^ r-Pl{R 1 -R 2 =r] 

r6range(-Ri-i?2) 

= 2J ? ' ir 2 ' Pr {^i = r l an d -^2 = r 2 } 

6 range (Ri) 

= y rir 2 ■ Pr{i?i = n and i? 2 = r 2 } 

ri6range(-Ri) r 2 £range(i?,2) 

= £ £ r 1 r2-Pr{i? 1 -r 1 }-Pr{i?2 = r 2 } 

rierange(-Ri) r 2 erange(-R.2) 



(ordering terms in the sum) 
(indep. of R 2 ) 



r i Pr {-Ri = '"l} • /J r 2 Pr {R2 = ^2} (factoring out r x Pr = ri}) 

riGrange(-Ri) \ r 2 erange(fl 2 ) / 

E riPr^i =n}-E[i? 2 ] (defofE[i? 2 ]) 

r-iGrange(i?i) 

E [^2] • E ri Pr i Rl = n } (factoring out E [R 2 ]) 

riGrange(-Ri) 

E [R 2 ] ■ E . (def of E[i?i]) 



Theorem 20.3.9 extends routinely to a collection of mutually independent vari- 
ables. 
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Corollary 20.3.10. If random variables Ri,R 2 , . . . ,Rk are mutually independent, then 



Y[E[Ri]. 



i=i 



20.3.6 Problems 
Practice Problems 
Problem 20.4. 

MIT students sometimes delay laundry for a few days. Assume all random values 
described below are mutually independent. 

(a) A busy student must complete 3 problem sets before doing laundry. Each 
problem set requires 1 day with probability 2/3 and 2 days with probability 1/3. 
Let B be the number of days a busy student delays laundry. What is E [£?]? 

Example: If the first problem set requires 1 day and the second and third problem 
sets each require 2 days, then the student delays for B = 5 days. 

(b) A relaxed student rolls a fair, 6-sided die in the morning. If he rolls a 1, then he 
does his laundry immediately (with zero days of delay). Otherwise, he delays for 
one day and repeats the experiment the following morning. Let R be the number 
of days a relaxed student delays laundry. What is E [R]? 

Example: If the student rolls a 2 the first morning, a 5 the second morning, and a 1 
the third morning, then he delays for R = 2 days. 

(c) Before doing laundry, an unlucky student must recover from illness for a num- 
ber of days equal to the product of the numbers rolled on two fair, 6-sided dice. 
Let U be the expected number of days an unlucky student delays laundry. What is 

E[t/]? 

Example: If the rolls are 5 and 3, then the student delays for U = 15 days. 

(d) A student is busy with probability 1/2, relaxed with probability 1/3, and un- 
lucky with probability 1 /6. Let D be the number of days the student delays laundry. 
WhatisE[D]? 



Problem 20.5. 

Each 6.042 final exam will be graded according to a rigorous procedure: 

• With probability | the exam is graded by a TA,with probability | it is graded 
by a lecturer, and with probability |, it is accidentally dropped behind the 
radiator and arbitrarily given a score of 84. 

• TAs score an exam by scoring each problem individually and then taking the 
sum. 
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- There are ten true / false questions worth 2 points each. For each, full 
credit is given with probability |, and no credit is given with probability 

I - 

- There are four questions worth 15 points each. For each, the score is 
determined by rolling two fair dice, summing the results, and adding 3. 

- The single 20 point question is awarded either 12 or 18 points with equal 
probability. 

• Lecturers score an exam by rolling a fair die twice, multiplying the results, 
and then adding a "general impression" score. 

- With probability ^, the general impression score is 40. 

- With probability ^, the general impression score is 50. 

- With probability f^, the general impression score is 60. 

Assume all random choices during the grading process are independent. 

(a) What is the expected score on an exam graded by a TA? 

(b) What is the expected score on an exam graded by a lecturer? 

(c) What is the expected score on a 6.042 final exam? 

Class Problems 
Problem 20.6. 

Let's see what it takes to make Carnival Dice fair. Here's the game with payoff 
parameter k: make three independent rolls of a fair die. If you roll a six 

• no times, then you lose 1 dollar. 

• exactly once, then you win 1 dollar. 

• exactly twice, then you win two dollars. 

• all three times, then you win k dollars. 
For what value of k is this game fair? 

Problem 20.7. 

A classroom has sixteen desks arranged as shown below. 
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If there is a girl in front, behind, to the left, or to the right of a boy, then the two of 
them flirt. One student may be in multiple flirting couples; for example, a student 
in a corner of the classroom can flirt with up to two others, while a student in the 
center can flirt with as many as four others. Suppose that desks are occupied by 
boys and girls with equal probability and mutually independently. What is the 
expected number of flirting couples? Hint: Linearity. 



Problem 20.8. 

Here are seven propositions: 



X\ 


OR 


x 3 


OR 


«7 


X~5 


OR 


x 6 


OR 


X 7 


x 2 


OR 


x~l 


OR 


x 6 


Xi 


OR 


%5 


OR 


X7 


^3 


OR 


X5 


OR 


x% 


Xg 


OR 


^8 


OR 


x 2 


^3 


OR 


x 9 


OR 


£4 



Note that: 

1. Each proposition is the disjunction (OR) of three terms of the form x t or the 
form xl. 

2. The variables in the three terms in each proposition are all different. 

Suppose that we assign true /false values to the variables xi,...,xg indepen- 
dently and with equal probability. 
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(a) What is the expected number of true propositions? 



Hint: Let T!, be an indicator for the event that the i-th proposition is true. 



(b) Use your answer to prove that for any set of 7 propositions satisfying the con- 
ditions 1. and 2., there is an assignment to the variables that makes all 7 of the 
propositions true. 



Problem 20.9. (a) Suppose we flip a fair coin until two Tails in a row come up. 
What is the expected number, iV TT , of flips we perform? Hint: Let D be the tree 
diagram for this process. Explain why D = H ■ D + T ■ (H ■ D + T). Use the Law 
of Total Expectation 20.3.5 



(b) Suppose we flip a fair coin until a Tail immediately followed by a Head come 
up. What is the expected number, N Tn , of flips we perform? 



(c) Suppose we now play a game: flip a fair coin until either TT or TH first occurs. 
You win if TT comes up first, lose if TH comes up first. Since TT takes 50% longer 
on average to turn up, your opponent agrees that he has the advantage. So you 
tell him you're willing to play if you pay him $5 when he wins, but he merely pays 
you a 20% premium, that is, $6, when you win. 



If you do this, you're sneakily taking advantage of your opponent's untrained in- 
tuition, since you've gotten him to agree to unfair odds. What is your expected 
profit per game? 



Problem 20.10. 

Justify each line of the following proof that if R\ and R 2 are independent, then 



E [R x ■ R 2 ] — E [Rj] ■ E [R 2 ] 
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Proof. 

E [Ri ■ R 2 ] 

J2 r-Pr{R 1 -R 2 = r} 

rGrange(i?i--R 2 ) 

= rir 2 ' Prl-Ri = r i an d R2 = r 2 } 

riGrange(-Ri) 

= 53 V" rir 2 • Pr{i?i = n and i? 2 = r 2 } 

nSrange(-Ri) r 2 erange(i?2) 

= 2 ^ r 1 r 2 -Pr{i? 1 = r 1 }-Pr{i? 2 = r 2 } 

riGrange(-Ri) r 2 Grange(-R 2 ) 

^ nPrji?! = n}- 51 r 2 Pr{i? 2 = r 2 } 

riGrange(-Ri) \ r 2 erange(fl 2 ) 

- 53 nPr^^nj-E^] 

riGrange(-Ri) 

= E[i? 2 ]- 53 nPr{i?i = n} 

ri£range(i?i) 

= E [R 2 ] ■ E [R^ . 



Problem 20.11. 

Here are seven propositions: 



Xl 


V 


x 3 


V 


^X7 


^x 5 


V 


x e 


V 


x 7 


X 2 


V 


~^X4 


V 


xe 




V 


x 5 


V 




X3 


V 


-nx 5 


V 




x g 


V 


-nx 8 


V 


x 2 


~^x 3 


V 


Xg 


V 


X4 



Note that: 

1. Each proposition is the OR of three terms of the form x% or the form -<Xi. 

2. The variables in the three terms in each proposition are all different. 

Suppose that we assign true /false values to the variables xi,...,xg independently 
and with equal probability. 

(a) What is the expected number of true propositions? 
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(b) Use your answer to prove that there exists an assignment to the variables that 
makes all of the propositions true. 



Problem 20.12. 

A literal is a propositional variable or its negation. A k-clause is an OR of fc literals, 
with no variable occurring more than once in the clause. For example, 

P OR Q OR R OR V, 

is a 4-clause, but 

V OR Q OR X OR V, 

is not, since V appears twice. 

Let S be a set of n distinct fc-clauses involving v variables. The variables in 
different fc-clauses may overlap or be completely different, so fc < v < nk. 

A random assignment of true / false values will be made independently to each 
of the v variables, with true and false assignments equally likely. Write formulas 
in n, fc, and v in answer to the first two parts below. 

(a) What is the probability that the last fc-clause in S is true under the random 
assignment? 

(b) What is the expected number of true fc-clauses in 5? 

(c) A set of propositions is satisfiable iff there is an assignment to the variables 
that makes all of the propositions true. Use your answer to part (b) to prove that if 
n < 2 k , then S is satisfiable. 



Problem 20.13. 

A gambler bets $10 on "red" at a roulette table (the odds of red are 18/38 which 
slightly less than even) to win $10. If he wins, he gets back twice the amount of his 
bet and he quits. Otherwise, he doubles his previous bet and continues. 

(a) What is the expected number of bets the gambler makes before he wins? 

(b) What is his probability of winning? 

(c) What is his expected final profit (amount won minus amount lost)? 

(d) The fact that the gambler's expected profit is positive, despite the fact that 
the game is biased against him, is known as the St. Petersberg paradox. The para- 
dox arises from an unrealistic, implicit assumption about the gambler's money. 
Explain. 

Hint: What is the expected size of his last bet? 
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Homework Problems 
Problem 20.14. 

Let R and S be independent random variables, and / and g be any functions such 
that domain (/) = codomain (R) and domain (g) = codomain (5). Prove that f(R) 
and g(S) are independent random variables. Hint: The event [f(R) = a] is the 
disjoint union of all the events [R — r] for r such that f(r) = a. 
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